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ABSTRACT 

The  sequential  broken  stick  model  has  appeared  in  numerous  contexts, 
including  biology,  physics,  engineering  and  geology.  Kolmogorov  showed 
that  under  appropriate  conditions,  sequential  breakage  processes  often 
yield  a  lognormal  distribution  of  particle  sizes.  Of  particular  interest 
to  ecologists  is  the  observed  variance  of  the  logarithms  of  the  sizes, 
which  characterizes  the  evenness  of  an  assemblage  of  species.  We  derive 
the  first  two  moments  for  the  logarithms  of  the  sizes  in  terms  of  the 
underlying  distribution  used  to  determine  the  successive  breakages.  In 
particular,  for  a  process  yielding  n  pieces,  the  expected  sample  vari¬ 
ance  behaves  asymptotically  as  n  log(n)  .  These  results  also  yield  a 
new  identity  for  moments  of  path  lengths  in  random  binary  trees. 
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1.  INTRODUCTION 


When  a  distributional  pattern  is  generated  by  an  unobservable  process, 
insight  into  the  mechanism  of  genesis  can  sometimes  be  Inferred  with  the 
aid  of  a  suitable  model.  Our  focus  here  will  be  on  lognormal  distribu¬ 
tions,  with  the  aim  of  studying  a  class  of  processes  that  give  rise  to 
empirical  families  of  lognormal  curves.  The  importance  of  this  pattern 
rests  largely  on  its  ubiquity  and  the  broad  spectrum  of  contexts  in  which 
it  appears. 

In  engineering  and  geology  lognormal  distributions  have  been  used 
to  describe  quantities  produced  by  natural  and  mechanical  processes,  such 
as  frequencies  of  particle  sizes  and  life  lengths  of  materials  and 
machines  before  failure  (Epstein,  1947;  Herdan,  1953).  In  economics  and 
sociology  these  distributions  have  been  fit  to  data  on  incomes  (Gibrat, 
1931;  Davies,  1946;  Kapteyan,  1916)  and  numbers  of  people  per  occupation 
(Clark,  1964) .  Applications  in  biology  Include  characterizing  data  on 
body  sizes  (Yuan,  1933;  Camp,  1938;  Cramer,  1946)  and  species  abundance 
(Preston,  1962;  Altchlson  and  Brown,  1968;  Patrick,  1968;  Bulmer,  1974; 

May,  1975;  Plelou,  1975;  Suglhara,  1980).  Brown  and  Sanders  (1981)  have 
shown  that  the  lognormal  distribution  arises  in  a  large  variety  of  classi¬ 
fication  procedures. 

Some  of  these  physical  and  biological  contexts,  where  the  natural 
method  of  genesis  involves  repeated  breakages,  produce  special  families 
of  lognormal  distributions.  Kolmogorov  (1941)  has  shown  that  when  the 
frequency  of  breakage  is  Independent  of  the  size  of  each  particle,  the 
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asynptotlc  distribution  of  particle  sizes  should  tend  to  be  lognormal. 

Of  interest  here  is  that  the  mean  and  variance  will  depend  on  the  number 
of  breakages  applied.  These  parameters  will  therefore  be  coupled  to  the 
number  of  particles  generated,  producing  families  of  lognormal  distribu¬ 
tions  in  which  the  variance  of  log  sizes  will  Increase  with  the  application 
of  additional  breakage  events. 

When  such  coupling  is  actually  observed,  it  may  suggest  sequential 
breakage  as  a  possible  method  of  genesis.  This  argument  was  used,  for 
example,  in  a  recent  model  of  species  abundance  (Sugihara,  1980)  in  which 
sequential  binary  breakages  of  niche  space  was  proposed  to  explain  a 
particular  coupling  of  parameters  observed  in  the  lognormal  species  abun¬ 
dance  distribution  (Preston,  1962).  Investigating  sequential  breakage 
mechanisms  may  be  useful  not  only  for  clarifying  the  predictions  of  this 
species  abundance  model,  but  also  in  general,  for  understanding  the 
genesis  of  empirical  families  of  lognormal  curves  having  coupled  parameters. 

Monte  Carlo  estimates  have  been  available  (Sugihara,  1980)  for  the 
relationship  between  the  expected  variance  and  the  number  of  particles 
(or  species)  in  some  special  cases  of  repeated  binary  breakage.  Our  aim 
here  Is  to  provide  exact  and  a8yiiq>totlc  formulae  for  this  relationship. 

In  addition  to  simplifying  computation,  these  results  will  also  yield 
further  Insight  into  the  nature  of  breakage  processes.  In  particular, 
we  will  show  how  the  expected  mean  and  variance  of  the  logarithmic  sizes 
can  be  expressed  in  terms  of  auxiliary  moments  of  the  distribution  of 
breakage  applied  at  each  step.  Underlying  these  results  is  a  somewhat 
surprising  Identity  Involving  cross-moments  of  path  lengths  in  random 
binary  trees. 
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2.  EXPECTED  MEANS  AND  VARIANCES 

Begin  with  a  stick  of  unit  length  and  a  breakage  distribution  F 
that  is  syrmnetrlc  on  (0,1)  .  This  is  stage  1;  at  stage  n  the  stick 
will  be  broken  into  n  pieces.  To  go  from  stage  n  to  stage  n+1  , 
first  choose  a  piece  at  random  (uniformly  without  regard  to  size,  so  that 
each  piece  has  probability  1/n  of  being  chosen)  and  then  break  it  in 
two  according  to  a  proportion  chosen  independently  from  F  .  Because  the 
piece  to  be  broken  is  chosen  randomly  in  this  way,  we  lose  no  generality 
by  requiring  F  to  be  symmetric,  in  order  to  simplify  the  mathematical 
treatment. 

If  W  is  an  observation  from  F  ,  define  moments  y  ■  ElIlog(W)3° 

n 

and  V  -  EClog(W)log(l-W)3  .  We  will  assume  that  and  are 

finite.  At  stage  n  ,  let  the  pieces  have  sizes  X,  , . . . ,X  _  in  some 

in  nn 

random  ordering  so  that  these  are  exchangeable  (but  not  Independent)  ran¬ 
dom  variables.  The  logarithms  of  the  sizes,  ■  log(Xj^^)  ,  are  of 

n 

interest  in  many  applications,  as  are  the  sample  mean  U  >  (  £  U.  )/n 

2  n  2  "  i-1 

and  the  sample  variance  S  •  (  L  (D.  -  U  )  )/(n-l)  .  Note  that  each 

n  in  n 

X^  is,  marginally,  the  product  of  a  random  number  of  Independent  propor¬ 
tions  chosen  from  F  ,  and  each  is  the  sum  of  the  corresponding 

logarithms. 


Theorem  1.  The  mean  and  variance  of  the  logarithm,  ,  of  the 

size  of  a  single  random  piece  at  stage  n  are 

E<D^)  -  i 


(1) 


(2) 


n  1  2  °  1 

Var(U.  )  -  2u,  Z  f  Z 

^  ^  k-2 


Proof.  By  exchangeability ,  It  will  suffice  to  compute  E(U.  )  . 

”  in 

Condition  according  to  whether  this  piece  was  or  was  not  involved  In  the 

most  recent  breakage,  events  with  probabilities  2/n  and  (n-2)/n 

respectively.  Because  the  conditional  distributions  of  the  lengths  are 

0.  ,  +  log(U)  and  U,  respectively,  where  W  has  distribution  F 

i^n^x  i^n^i 

and  Is  Independent  of  U.  .  ,  we  obtain  the  recurrence 

i  g  n^i 


''“in’  ■  ;  “l 


(3) 


lidiose  solution  with  Initial  condition  E(Uj^  1^  "  ®  is  (1).  For  the 

variance,  Var(U,  )  ■  E(U^  )  -  Ce(U,  )]^,  condition  as  before  for  each 
In  In  In  ’ 

term  of  this  difference,  to  establish 


*  n  '‘I'O’l.n-l’  *  S  "2 


(4) 


and 


Subtracting  (S)  from  (4)  and  simplifying,  we  obtain 
Var(U,  )  -  Var(D 


In 


l,n-l>  ^  f  ^2  -  7 


(6) 


whose  solution  with  Initial  condition  Var(U^^)  *0  Is  (2).  Q 

In  a  real  situation,  care  must  be  taken  with  regard  to  the  variance 

2 

tern.  Due  to  the  dependence  among  ,  the  sample  variance 
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should  not  be  compared  to  (2) ,  which  is  the  expected  sample  variance  for 
an  independent  sample  with  the  same  marginals.  Instead,  the  following 
quantities  should  be  used. 


Theorem  2.  The  expected  sample  mean  and  variance  at  stage  n  are 


E(n  )  -  2y  I  ^ 

“  -L  k.2 


(7) 


E(S^)  -  {2(1  +  -^)  E  i  -  2}  y,  -  V 
n  n-1  K  2 


(8) 


Proof.  Equation  (7)  follows  by  linearity  from  (1) .  For  (8) ,  expand 
and  use  exchangeability  to  obtain 


E(sf)  -  E(D^  -  • 


(9) 


Next,  condition  according  to  the  four  possibilities  of  involvement  of 


and  in  the  most  recent  breakage  (neither,  only,  only. 


2n 


In 


'2n 


or  both).  For  example,  with  probability  2/(n(n-l))  they  were  both 


Involved  in  the  most  recent  breakage,  and  has  the  condi¬ 


tional  distribution 


CU 


+  log(W)r  -  +  log(W)XUj^^_j  +  log(l-W)]} 


(10) 


where  W  is  a  random  variable  with  distribution  F  and  is  independent 


of  *  Combining  this  with  the  results  from  the  other  three 


J 
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possibilities,  then  simplifying,  we  find  with  some  effort  that 


2v 


n  n(n-l)  n-1  n  n(n-l) 


(11) 


With  some  patience,  it  can  be  shown  by  induction  that  (8)  is  the  solution 

2 

to  the  recurrence  (11)  with  initial  condition  8(82)  ■  li2  “  ° 


The  means  (1)  and  (7)  are  identical  because  the  expectation  operator 
is  linear  even  under  dependence.  However,  from  (2)  and  (8)  we  see  that  the 
lack  of  Independence  has  modified  the  variance.  These  differences  can  be 
studied  in  detail  by  examining  an  asymptotic  expansion  of  each  expression. 


Theorem  3. 


E(U^)  -  E(U^)  -  U,{21og(n)  -  .8456+^-- ^5-+— i^+0(-^)} 

6n  60n  n” 


(12) 


2  ^*"1  ^2 
Var(Uj^)  -  2V2log(n)  -  (2.5797vJ+ .8456^2)  +  ■  -  — 


12m^  y2  2^1 

.2  ,3  4 


6n 


Mo  2m 

+  0(^) 

3n-'  60n^  15n^  n 


(13) 


E(S^)  -  2M2lo8(n)  -(V  +  2.8456M2)  +  4M2  - " 


.6911m. 


.1422m,  .02447m,  .007804m,  .008863m,  , 

+  — 2^ - - r-^+ — 

n  n  n  n  n 


(14) 


Proof.  These  follow  from  two  standard  asymptotic  expansions.  Equation 
(12)  depends  on  the  expansion  of  the  harmonic  series  (e.g.  Knuth,  1973, 

Vol.  1,  p.74): 
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E  i  -  log(n)  +  Y  +  ^  +  0(4-) 

k-1  ^  2n  120n^  n® 


where  Y=* 5772156649  is  Euler's  constant.  Equation  (13)  also  uses 
(e.g.  p.61  of  Hansen,  1975) 


k-1  k^  “  “  2n^  6n^  30n^  rJ 


For  (14) ,  use  (15)  in  (8)  and  multiply  the  nonlogarithmic  part  by  the 
expansion  of  2/(n~l)  as  a  power  series  in  1/n  .  Combine  terms  to  find 


E(S^)  -  {21og(n)  +  (2y-4)  +  4  +  !£1=1  + 

n  n-1  a  ^^2 


^2^^  240i:^^  240]tlM}^  v  +  ^ 

6n^  60n  60n  ^  n 


tdilch  evaluates  to  (14). 


Although  the  leading  terms  in  the  variances  (13)  and  (14)  are  identi¬ 
cal,  the  differences  in  their  second  terms  cannot  be  neglected  even  for 
moderately  large  n  due  to  the  slowly  increasing  behavior  of  log(n)  . 

For  example,  when  F  places  mass  1/2  at  1/4  and  at  3/4,  even  with 

2 

n  as  large  as  50,  we  have  Var(U.  )•  5.3  while  E(S  )■  4.9  • 

m  n 


3.  AN  IDENTITY  FOR  RANDOM  BINARY  TREES 

Consider  the  class  of  random  binary  trees  with  n  endpoints  generated 
recursively  by  bifurcating  an  endpoint  chosen  uniformly  at  random  from  a 
tree  with  n-1  endpoints.  These  trees  are  responsible  for  part  of  the 
randomness  of  the  sequential  breakage  model  (the  other  component  can  be 
thought  of  as  entering  through  the  distribution  F  ) .  These  trees  are 
related  to  random  binary  search  trees  used  in  computer  science  (Knuth,  1973,  Vol. 
3,  p. 423-471).  For  a  tree  with  n  endpoints,  let  N^^^  and  N^^  denote 
the  distances  (in  numbers  of  edges)  from  each  of  two  randomly  chosen  end¬ 
points  to  their  nearest  common  ancestor,  as  Illustrated  in  the  figure. 

Although  moments  involving  N^^  and  N2^  generally  increase  with  n  , 
there  is  an  expression  for  which  this  dependence  cancels  out. 

Theorem  4.  Regardless  of  the  value  of  n  , 

*  Vzu  -  ^ 

Proof .  Proceed  by  Induction  on  n  ,  conditioning  on  the  four  events 
describing  which  of  N^^^^  and  N2^  were  involved  in  the  most  recent 
bifurcation.  This  is  similar  to  the  proof  of  Theorem  2.  n 
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